('2025-08-21 13:34:30', 1755772470.672871)

TinyML-Autopilot Processor Comparison Analysis¶

This notebook analyzes the performance comparison between PSG and TPUSG processors across different models and parameter configurations.

Core Analysis Framework:¶

  • 2 Processors: PSG vs TPUSG
  • 5 Models: qwen32b, qwen14b, phi4, gemma3:27b, codestral
  • 2 Conditions: With Parameters vs Without Parameters
  • Goal: Determine which processor performs better under different configurations

Define Omissive Errors¶

Combine csv Files¶

Found 62 CSV files to combine

Combined data saved to: /home/han/Projects/reference-benchmark-tinyml_llm/combined_tinyml_benchmark_data.csv
Total rows: 1774
Total unique batch_ids: 61
'/home/han/Projects/reference-benchmark-tinyml_llm/combined_tinyml_benchmark_data.csv'

Assigning 20 Categories According to Processors, Models, and Parameters¶

Removing 83 rows due to skipped_error set.
Dataset loaded successfully!
Shape: (1691, 20)
Final shape: (1571, 16)
Processor distribution: {'psg': 827, 'tpusg': 744}
Parameter distribution: {True: 853, False: 718}
Category distribution: {'psg-qwen32b-True': 136, 'tpusg-phi4-True': 120, 'tpusg-qwen32b-True': 118, 'psg-phi4-True': 90, 'tpusg-qwen32b-False': 90, 'psg-qwen14b-False': 90, 'psg-codestral-True': 90, 'psg-phi4-False': 90, 'psg-qwen14b-True': 87, 'tpusg-codestral-True': 87, 'tpusg-phi4-False': 82, 'tpusg-qwen14b-False': 75, 'psg-codestral-False': 72, 'psg-gemma3:27b-False': 60, 'psg-qwen32b-False': 60, 'tpusg-codestral-False': 56, 'psg-gemma3:27b-True': 52, 'tpusg-gemma3:27b-True': 43, 'tpusg-gemma3:27b-False': 43, 'tpusg-qwen14b-True': 30}
num_run name batch_id status latency total_tokens prompt_tokens completion_tokens parameters generation_count tags timestamp test_date model_config processor category
0 1 e2fa_tpu_sketch_generator codestral_34a5_tpusg failure 110.76 13175 10240 2935 True 5 ['benchmark', 'codestral:latest', 'tpu_sketch_... 1755697779 08.24 codestral tpusg tpusg-codestral-True
1 2 4a9e_tpu_sketch_generator codestral_34a5_tpusg failure 100.66 13827 10240 3587 True 5 ['benchmark', 'codestral:latest', 'tpu_sketch_... 1755697904 08.24 codestral tpusg tpusg-codestral-True
2 3 d3b0_tpu_sketch_generator codestral_34a5_tpusg failure 48.59 11737 10240 1497 True 5 ['benchmark', 'codestral:latest', 'tpu_sketch_... 1755698034 08.24 codestral tpusg tpusg-codestral-True
3 4 8c83_tpu_sketch_generator codestral_34a5_tpusg failure 99.30 13766 10240 3526 True 5 ['benchmark', 'codestral:latest', 'tpu_sketch_... 1755698097 08.24 codestral tpusg tpusg-codestral-True
4 5 05c6_tpu_sketch_generator codestral_34a5_tpusg failure 119.65 14585 10240 4345 True 5 ['benchmark', 'codestral:latest', 'tpu_sketch_... 1755698227 08.24 codestral tpusg tpusg-codestral-True

Grouping and Aggregating Test Batches¶

Total runs: 1571: PSG/TPUSG runs: 827/744
Models : codestral, gemma3:27b, phi4, qwen14b, qwen32b
Parameter conditions: P (853) vs NP (718)

📈 Complete Processor Comparison Matrix:
------------------------------------------------------------
processor                  psg  tpusg
model_config parameters              
codestral    False        50.0   28.6
             True         10.0   11.5
gemma3:27b   False        53.3    2.3
             True         71.2    0.0
phi4         False        77.8   95.1
             True        100.0  100.0
qwen14b      False        52.2   92.0
             True          4.6  100.0
qwen32b      False        50.0   96.7
             True         39.7   33.9

🎯 PROCESSOR ADVANTAGE ANALYSIS:
------------------------------------------------------------
codestral (With params): PSG 10.0% vs TPUSG 11.5% → TPUSG (+1.5%)
codestral (Without params): PSG 50.0% vs TPUSG 28.6% → PSG (+-21.4%)
gemma3:27b (With params): PSG 71.2% vs TPUSG 0.0% → PSG (+-71.2%)
gemma3:27b (Without params): PSG 53.3% vs TPUSG 2.3% → PSG (+-51.0%)
phi4 (With params): PSG 100.0% vs TPUSG 100.0% → TIE (+0.0%)
phi4 (Without params): PSG 77.8% vs TPUSG 95.1% → TPUSG (+17.3%)
qwen14b (With params): PSG 4.6% vs TPUSG 100.0% → TPUSG (+95.4%)
qwen14b (Without params): PSG 52.2% vs TPUSG 92.0% → TPUSG (+39.8%)
qwen32b (With params): PSG 39.7% vs TPUSG 33.9% → PSG (+-5.8%)
qwen32b (Without params): PSG 50.0% vs TPUSG 96.7% → TPUSG (+46.7%)

📊 SUMMARY:
PSG wins: 4/10 configurations
TPUSG wins: 5/10 configurations
Ties: 1/10 configurations

📋 COMPLETE COMPARISON TABLE (Traditional + Weighted Success Rates):
----------------------------------------------------------------------------------------------------
processor model_config parameters total_runs num_batches successes success_rate efficiency_weighted_rate exponential_weighted_rate linear_weighted_rate robust_weighted_rate avg_tokens
0 psg codestral False 72 2.4 36 50.00 25.16 28.16 36.67 42.92 10099.29
1 psg codestral True 90 3.0 9 10.00 2.67 2.52 4.22 5.33 13197.82
2 psg gemma3:27b False 60 2.0 32 53.33 30.42 32.13 39.67 44.00 9948.50
3 psg gemma3:27b True 52 1.7 37 71.15 15.58 11.99 19.62 29.42 12714.37
4 psg phi4 False 90 3.0 70 77.78 37.85 39.12 49.56 56.44 9168.93
5 psg phi4 True 90 3.0 90 100.00 100.00 100.00 100.00 100.00 2242.97
6 psg qwen14b False 90 3.0 47 52.22 28.26 30.20 37.78 42.33 9432.88
7 psg qwen14b True 87 2.9 4 4.60 1.38 1.37 2.07 2.87 11955.85
8 psg qwen32b False 60 2.0 30 50.00 35.53 36.00 40.00 42.33 9810.17
9 psg qwen32b True 136 4.5 54 39.71 9.62 8.19 12.94 18.09 13483.35
10 tpusg codestral False 56 1.9 16 28.57 12.08 12.71 17.14 19.82 12682.05
11 tpusg codestral True 87 2.9 10 11.49 3.35 3.40 5.75 6.90 13309.66
12 tpusg gemma3:27b False 43 1.4 1 2.33 2.33 2.33 2.33 2.33 15158.37
13 tpusg gemma3:27b True 43 1.4 0 0.00 0.00 0.00 0.00 0.00 15248.60
14 tpusg phi4 False 82 2.7 78 95.12 76.61 79.49 85.85 91.95 4790.41
15 tpusg phi4 True 120 4.0 120 100.00 100.00 100.00 100.00 100.00 2749.41
16 tpusg qwen14b False 75 2.5 69 92.00 70.16 72.60 80.00 85.20 5259.63
17 tpusg qwen14b True 30 1.0 30 100.00 100.00 100.00 100.00 100.00 2609.00
18 tpusg qwen32b False 90 3.0 87 96.67 72.26 75.12 83.33 90.11 5151.87
19 tpusg qwen32b True 118 3.9 40 33.90 9.11 8.66 14.41 17.80 14093.42

Comparison, Visualization, and Insights¶

Comprehensive visual analysis and strategic recommendations for processor selection.

📊 CREATING PROCESSOR COMPARISON VISUALIZATIONS
============================================================
No description has been provided for this image
✅ Processor comparison visualizations created successfully!
📊 Analysis shows clear performance differences between PSG and TPUSG processors

Five Different Success Rate Metrics¶

🎓 PROCESSOR COMPARISON INSIGHTS
============================================================
📊 OVERALL PERFORMANCE:
Traditional Success Rate:
  PSG Average: 50.9%
  TPUSG Average: 56.0%
  Overall Winner: TPUSG (+5.1%)
Efficiency Weighted:
  PSG Average: 28.6%
  TPUSG Average: 44.6%
  Winner: TPUSG (+15.9%)
Exponential Weighted:
  PSG Average: 29.0%
  TPUSG Average: 45.4%
  Winner: TPUSG (+16.5%)
Linear Weighted:
  PSG Average: 34.3%
  TPUSG Average: 48.9%
  Winner: TPUSG (+14.6%)
Robust Weighted:
  PSG Average: 38.4%
  TPUSG Average: 51.4%
  Winner: TPUSG (+13.0%)

⚙️ PARAMETER EFFECTS:
Traditional Success Rate:
  PSG: With params 45.1% vs Without params 56.7% (Effect: -11.6%)
  TPUSG: With params 49.1% vs Without params 62.9% (Effect: -13.9%)
Efficiency Weighted:
  PSG: With params 25.9% vs Without params 31.4% (Effect: -5.6%)
  TPUSG: With params 42.5% vs Without params 46.7% (Effect: -4.2%)
Exponential Weighted:
  PSG: With params 24.8% vs Without params 33.1% (Effect: -8.3%)
  TPUSG: With params 42.4% vs Without params 48.5% (Effect: -6.0%)
Linear Weighted:
  PSG: With params 27.8% vs Without params 40.7% (Effect: -13.0%)
  TPUSG: With params 44.0% vs Without params 53.7% (Effect: -9.7%)
Robust Weighted:
  PSG: With params 31.1% vs Without params 45.6% (Effect: -14.5%)
  TPUSG: With params 44.9% vs Without params 57.9% (Effect: -12.9%)

⚙️ PARAMETER USAGE STRATEGY:
   PSG: Parameters hurt performance (-11.6%)
   TPUSG: Parameters hurt performance (-13.9%)
📊 TRADITIONAL vs WEIGHTED SUCCESS RATES COMPARISON
======================================================================
No description has been provided for this image
✅ Weighted success rate analysis complete!
📈 The efficiency-weighted metrics show TPUSG has a 15.9% advantage vs 5.1% traditional.

Success Rates per Model, TPUSG and PSG, w and w/o parameters¶

📊 DETAILED PROCESSOR COMPARISON BY MODEL & PARAMETERS
======================================================================
No description has been provided for this image
📈 SUMMARY COMPARISON TABLE:
--------------------------------------------------------------------------------
     Model PSG+Params   PSG TPUSG+Params TPUSG  Best_Config
 codestral      10.0% 50.0%        11.5% 28.6%          PSG
gemma3:27b      71.2% 53.3%         0.0%  2.3%   PSG+Params
      phi4     100.0% 77.8%       100.0% 95.1%   PSG+Params
   qwen14b       4.6% 52.2%       100.0% 92.0% TPUSG+Params
   qwen32b      39.7% 50.0%        33.9% 96.7%        TPUSG

PSG vs TPUSG From Multi-dimensions¶

📊 RUN-LEVEL STANDARD DEVIATION & STATISTICAL ANALYSIS
================================================================================
Removing 83 rows due to skipped_error set.
Dataset loaded successfully! Shape: (1691, 20)
Cleaned dataset shape: (1571, 16)
Processor distribution: {'psg': 827, 'tpusg': 744}
Date range: 2025-07-29 11:00:38 to 2025-08-20 14:52:35

📊 RUN-LEVEL DATASET STATISTICS:
--------------------------------------------------

PSG Processor:
  Total runs: 827
  Unique batches: 27
  Avg runs per batch: 30.6
  Success rate: 49.5%
  Successful runs: 409
  Failed runs: 418
  Avg generation count: 3.82
  Generation count range: [1, 5]
  Avg generations for success: 2.62

TPUSG Processor:
  Total runs: 744
  Unique batches: 29
  Avg runs per batch: 25.7
  Success rate: 60.6%
  Successful runs: 451
  Failed runs: 293
  Avg generation count: 2.99
  Generation count range: [1, 5]
  Avg generations for success: 1.70

📈 2. RUN-LEVEL METRICS CALCULATION
============================================================
📊 COMPREHENSIVE RUN-LEVEL STATISTICS:
======================================================================

Success Rate (Run Level):
------------------------------------------------------------
  PSG: 49.5% ± 50.0% (n=827 runs)
    95% CI: [46.0%, 52.9%]
    CV: 1.012
  TPUSG: 60.6% ± 48.9% (n=744 runs)
    95% CI: [57.1%, 64.1%]
    CV: 0.807

Generation Count (Attempts per Run):
------------------------------------------------------------
  PSG: 3.821 ± 1.605 (n=827 runs)
    95% CI: [3.712, 3.930]
    Range: [1.000, 5.000]
    Median (IQR): 5.000 [2.000, 5.000]
    CV: 0.420
  TPUSG: 2.991 ± 1.836 (n=744 runs)
    95% CI: [2.859, 3.123]
    Range: [1.000, 5.000]
    Median (IQR): 3.000 [1.000, 5.000]
    CV: 0.614

Total Tokens per Run:
------------------------------------------------------------
  PSG: 10292.073 ± 4542.158 (n=827 runs)
    95% CI: [9982.498, 10601.647]
    Range: [2220.000, 17991.000]
    Median (IQR): 12283.000 [5859.000, 13798.000]
    CV: 0.441
  TPUSG: 8733.617 ± 5582.263 (n=744 runs)
    95% CI: [8332.492, 9134.742]
    Range: [2499.000, 16582.000]
    Median (IQR): 8753.500 [2750.000, 15018.500]
    CV: 0.639

Latency per Run (seconds):
------------------------------------------------------------
  PSG: 73.421 ± 45.820 (n=827 runs)
    95% CI: [70.298, 76.544]
    Range: [10.220, 263.980]
    Median (IQR): 67.560 [38.270, 103.505]
    CV: 0.624
  TPUSG: 77.298 ± 61.343 (n=744 runs)
    95% CI: [72.890, 81.706]
    Range: [12.570, 244.300]
    Median (IQR): 56.790 [19.735, 140.567]
    CV: 0.794

🔬 3. STATISTICAL SIGNIFICANCE TESTING (RUN LEVEL)
============================================================
🔬 STATISTICAL TEST RESULTS:
======================================================================

Success Rate (Run Level):
------------------------------------------------------------
  Sample sizes: PSG=827, TPUSG=744
  PSG: 49.5% ± 50.0% (CV: 1.012)
  TPUSG: 60.6% ± 48.9% (CV: 0.807)
  Difference: +11.2% (TPUSG - PSG)
  Chi-square: χ²=19.249, p=0.0000, Significant: Yes
  Effect size (Cohen's h): 0.225 (medium)
  Interpretation: TPUSG outperforms PSG

Generation Count (Attempts per Run):
------------------------------------------------------------
  Sample sizes: PSG=827, TPUSG=744
  PSG: 3.821 ± 1.605 (CV: 0.420)
  TPUSG: 2.991 ± 1.836 (CV: 0.614)
  Difference: -0.830 (TPUSG - PSG)
  Mann-Whitney U: U=381602, p=0.0000, Significant: Yes
  T-test: t=9.497, p=0.0000, Significant: Yes
  Effect size (Cohen's d): -0.483 (medium)
  Interpretation: TPUSG underperforms PSG by medium effect

Total Tokens per Run:
------------------------------------------------------------
  Sample sizes: PSG=827, TPUSG=744
  PSG: 10292.073 ± 4542.158 (CV: 0.441)
  TPUSG: 8733.617 ± 5582.263 (CV: 0.639)
  Difference: -1558.456 (TPUSG - PSG)
  Mann-Whitney U: U=309222, p=0.8605, Significant: No
  T-test: t=6.028, p=0.0000, Significant: Yes
  Effect size (Cohen's d): -0.308 (medium)
  Interpretation: TPUSG underperforms PSG by medium effect

Latency per Run (seconds):
------------------------------------------------------------
  Sample sizes: PSG=827, TPUSG=744
  PSG: 73.421 ± 45.820 (CV: 0.624)
  TPUSG: 77.298 ± 61.343 (CV: 0.794)
  Difference: +3.877 (TPUSG - PSG)
  Mann-Whitney U: U=310380, p=0.7606, Significant: No
  T-test: t=-1.407, p=0.1597, Significant: No
  Effect size (Cohen's d): 0.072 (small)
  Interpretation: TPUSG outperforms PSG by small effect

📈 4. RUN-LEVEL VARIANCE ANALYSIS VISUALIZATION
============================================================
STATISTICAL SUMMARY (RUN LEVEL):
--------------------------------
Significant Differences (p < 0.05):

✓ Success Binary: Chi-square p=0.0000 (Cohen's h: 0.225, medium)
✓ Generation Count: Mann-Whitney p=0.0000 (Cohen's d: -0.483, medium)
✗ Total Tokens: Mann-Whitney p=0.8605 (Cohen's d: -0.308, medium)
✗ Latency: Mann-Whitney p=0.7606 (Cohen's d: 0.072, small)

Sample Sizes:
Total runs: 1571
PSG runs: 827 (52.6%)
TPUSG runs: 744 (47.4%)
Unique batches: 56

📋 5. COMPREHENSIVE RUN-LEVEL ANALYSIS SUMMARY
======================================================================
🎯 RUN-LEVEL PERFORMANCE SUMMARY:
--------------------------------------------------
Dataset Overview:
  Total runs analyzed: 1571
  Unique batches: 56
  Average runs per batch: 28.1
  Date range: 2025-07-29 to 2025-08-20

📊 KEY FINDINGS:
------------------------------
Success Rates (Run Level):
  PSG: 49.5% (409/827 runs)
  TPUSG: 60.6% (451/744 runs)
  Difference: +11.2% (TPUSG - PSG)

Generation Efficiency (Attempts per Run):
  PSG: 3.82 ± 1.60
  TPUSG: 2.99 ± 1.84
  More Efficient: TPUSG

📊 CONSISTENCY COMPARISON (CV):
----------------------------------------
Success Rate: TPUSG more consistent (PSG: 1.012, TPUSG: 0.807)
Generation Count: PSG more consistent (PSG: 0.420, TPUSG: 0.614)
Total Tokens per Run: PSG more consistent (PSG: 0.441, TPUSG: 0.639)
Latency per Run: PSG more consistent (PSG: 0.624, TPUSG: 0.794)

🔬 STATISTICAL SIGNIFICANCE SUMMARY:
--------------------------------------------------
Metrics with significant differences: 2/4
Statistical power: High (large run-level sample sizes)

✅ RUN-LEVEL ANALYSIS COMPLETE!
📊 Analysis based on 1571 individual runs as primary data points
🎯 Each run represents one complete test execution with success/failure outcome
📈 Generation count reflects efficiency of achieving success within each run
🔬 Statistical tests performed at the most granular level for maximum power
No description has been provided for this image

Temporal Analysis¶

📈 BATCH SUCCESS RATE TIMELINE ANALYSIS
======================================================================
🔄 Processing batch timeline data...
📊 Analyzed 56 unique batches
📅 Date range: 2025-07-29 11:00 to 2025-08-20 13:49

📈 CREATING BATCH TIMELINE VISUALIZATION...
No description has been provided for this image
📊 BATCH TIMELINE SUMMARY:
--------------------------------------------------
Total unique batches: 56
Date range: 12 unique days

Processor breakdown:
  PSG batches: 27 (avg success: 47.1%)
  TPUSG batches: 29 (avg success: 60.2%)

🏆 Best performing batch:
  qwen32b_4e11_tpusg: 100.0% (tpusg, qwen32b)

📉 Worst performing batch:
  qwen14b_33b8_psg: 0.0% (psg, qwen14b)

✅ Batch timeline analysis complete!
📈 Timeline shows 56 batches across 21 processor-date combinations
📊 DETAILED BATCH TIMELINE: WITH/WITHOUT PARAMETERS ANALYSIS
======================================================================
🔄 Creating parameter-separated visualizations...
No description has been provided for this image
📊 PARAMETER-SPECIFIC BATCH SUMMARY:
------------------------------------------------------------
📈 WITH Parameters:
  Total batches: 28
  PSG: 15 batches (avg: 39.9%)
  TPUSG: 13 batches (avg: 43.8%)
  Best with params: qwen14b_3193_tpusg (100.0%)
  Worst with params: qwen14b_33b8_psg (0.0%)

📉 WITHOUT Parameters:
  Total batches: 28
  PSG: 12 batches (avg: 56.2%)
  TPUSG: 16 batches (avg: 73.5%)
  Best without params: qwen32b_4e11_tpusg (100.0%)
  Worst without params: gemma3:27b_04fd_tpusg (0.0%)

⚙️ PARAMETER EFFECT ANALYSIS:
--------------------------------------------------
qwen32b:
  PSG: 41.2% (with) vs 50.0% (without) → -8.8% effect
  TPUSG: 33.7% (with) vs 96.5% (without) → -62.8% effect
codestral:
  PSG: 10.0% (with) vs 54.8% (without) → -44.8% effect
  TPUSG: 11.7% (with) vs 28.2% (without) → -16.5% effect
phi4:
  PSG: 100.0% (with) vs 73.3% (without) → +26.7% effect
  TPUSG: 100.0% (with) vs 96.3% (without) → +3.7% effect
qwen14b:
  PSG: 4.7% (with) vs 52.2% (without) → -47.5% effect
  TPUSG: 100.0% (with) vs 89.0% (without) → +11.0% effect
gemma3:27b:
  PSG: 73.8% (with) vs 53.3% (without) → +20.5% effect
  TPUSG: 0.0% (with) vs 3.3% (without) → -3.3% effect

✅ Parameter-separated batch timeline analysis complete!
📊 Analysis reveals parameter effects across 56 batches in timeline order

Temporal Trends¶

📈 20-CONFIGURATION TIMELINE WITH ORGANIZED LINES
================================================================================
🚀 Starting 20-configuration timeline analysis...
🔄 Processing 20-configuration batch timeline data...
📊 Analyzed 56 unique batches
📅 Date range: 2025-07-29 11:00 to 2025-08-20 13:49
🔧 Total configurations found: 20

📋 ORGANIZING 20 CONFIGURATIONS:
============================================================
🔵 PSG Group: 10 configurations
🔴 TPUSG Group: 10 configurations

📈 CREATING 20-LINE ORGANIZED TIMELINE...
📊 Creating timeline with 10 PSG + 10 TPUSG lines
🎨 Plotting PSG configuration lines...
🎨 Plotting TPUSG configuration lines...
📊 20-CONFIGURATION TIMELINE SUMMARY
==================================================


🔵 PSG CONFIGURATIONS:
-----------------------------------
phi4         WithParams 100.0% ( 2b)
gemma3:27b   WithParams  73.8% ( 2b)
phi4         NoParams    73.3% ( 2b)
codestral    NoParams    54.8% ( 3b)
gemma3:27b   NoParams    53.3% ( 2b)
qwen14b      NoParams    52.2% ( 3b)
qwen32b      NoParams    50.0% ( 2b)
qwen32b      WithParams  41.2% ( 5b)
codestral    WithParams  10.0% ( 3b)
qwen14b      WithParams   4.7% ( 3b)

🔴 TPUSG CONFIGURATIONS:
-----------------------------------
phi4         WithParams 100.0% ( 3b)
qwen14b      WithParams 100.0% ( 1b)
qwen32b      NoParams    96.5% ( 4b)
phi4         NoParams    96.3% ( 2b)
qwen14b      NoParams    89.0% ( 6b)
qwen32b      WithParams  33.7% ( 4b)
codestral    NoParams    28.2% ( 2b)
codestral    WithParams  11.7% ( 3b)
gemma3:27b   NoParams     3.3% ( 2b)
gemma3:27b   WithParams   0.0% ( 2b)

🏆 OVERALL AVERAGES:
-------------------------
PSG Average:    51.3%
TPUSG Average:  55.9%
Winner: TPUSG (+4.5%)

No description has been provided for this image
✅ 20-LINE TIMELINE COMPLETE!
==================================================
📈 Successfully plotted 20 configuration lines
🔵 PSG lines: 10 (blue colors)
🔴 TPUSG lines: 10 (red colors)
━ Solid lines: With parameters
╌ Dashed lines: Without parameters
📊 Each line shows temporal evolution of one specific configuration
📊 Current Data Structure Analysis
==================================================
✅ batch_timeline_data exists with 56 records
Date range: 2025-07-29 11:00:38 to 2025-08-20 13:49:39

Unique configurations: 20
📈 LOCAL SLOPE AGGREGATION ANALYSIS
================================================================================
🔄 Calculating slopes for each configuration...
📊 Calculated 36 slopes from 20 configurations
📅 Slope date range: 2025-07-29 to 2025-08-20
⏱️ Time spans: 0.01 to 20.17 days
📈 Slope range: -150.83 to 2568.62 %/day

📋 SLOPES SUMMARY BY PROCESSOR:
--------------------------------------------------
🔵 PSG: 17 slopes, mean: 0.396 %/day
🔴 TPUSG: 19 slopes, mean: 137.491 %/day

📊 SAMPLE SLOPES:
----------------------------------------
            config_id start_date   end_date  time_span_days      slope
  tpusg_False_qwen32b 2025-07-29 2025-07-29        0.047130 223.426326
  tpusg_False_qwen32b 2025-07-29 2025-08-03        5.467604  -0.609042
  tpusg_False_qwen32b 2025-08-03 2025-08-04        0.594931   5.597292
tpusg_False_codestral 2025-07-29 2025-08-12       14.257488   0.718920
   tpusg_True_qwen32b 2025-07-30 2025-08-03        4.545972  -8.647215
   tpusg_True_qwen32b 2025-08-03 2025-08-04        0.263056  35.391763
   tpusg_True_qwen32b 2025-08-04 2025-08-04        0.399456 -14.669950
     psg_True_qwen32b 2025-07-30 2025-08-03        4.529919  -0.735112
     psg_True_qwen32b 2025-08-03 2025-08-04        0.262037  62.014134
     psg_True_qwen32b 2025-08-04 2025-08-04        0.401898 -57.029374

============================================================
🔄 Aggregating slopes by adjacent day pairs...
📅 Found 23 unique dates in slope data
📊 Created 22 aggregated day-pair slope measurements
🔵 PSG: 21 day-pairs with data
🔴 TPUSG: 22 day-pairs with data
📈 PSG slope range: -4.12 to 23.11 %/day
📈 TPUSG slope range: -1.97 to 7.04 %/day

📊 AGGREGATED SLOPES PREVIEW:
--------------------------------------------------
day_pair_start day_pair_end  psg_mean_slope  tpusg_mean_slope  psg_slope_count  tpusg_slope_count
    2025-07-29   2025-07-30             NaN          0.054939                0                  2
    2025-07-30   2025-07-31       -0.496120         -1.330875                4                  6
    2025-07-31   2025-08-01       -0.496120         -1.330875                4                  6
    2025-08-01   2025-08-02       -0.496120         -1.330875                4                  6
    2025-08-02   2025-08-03       -0.496120         -1.330875                4                  6
    2025-08-03   2025-08-04       15.191192          7.043344                4                  6
    2025-08-04   2025-08-05       -4.116323          0.317752                4                  4
    2025-08-05   2025-08-06       -1.094686          0.211835                6                  6
    2025-08-06   2025-08-07       -1.094686          0.211835                6                  6
    2025-08-07   2025-08-08       -1.094686          0.211835                6                  6
📈 CREATING AGGREGATED SLOPE VISUALIZATION...
============================================================
🎨 Creating aggregated slope trends visualization...
📊 AGGREGATED SLOPE ANALYSIS SUMMARY
============================================================

🎯 METHODOLOGY:
1. Calculate slopes between consecutive points for each config
2. For each day-pair, average slopes of all configs covering it
3. Plot aggregated trend (NOT success rate, but rate of change)

📈 SLOPE TREND COMPARISON:
-----------------------------------
PSG Average Slope:     2.36 ± 6.11 %/day
TPUSG Average Slope:  -0.00 ± 1.71 %/day
Difference:           -2.36 %/day

🏆 OVERALL TRENDS:
-------------------------
PSG Trend:   Improving (+2.36 %/day)
TPUSG Trend: Declining (-0.00 %/day)

📊 DATA COVERAGE:
--------------------
PSG day-pairs:   21/22
TPUSG day-pairs: 22/22
Date range:      2025-07-29 to 2025-08-20

Avg configs per day-pair:
  PSG: 4.6 configurations
  TPUSG: 4.8 configurations

No description has been provided for this image
✅ AGGREGATED SLOPE ANALYSIS COMPLETE!
==================================================
📊 Successfully analyzed 22 day-pair measurements
🎯 Key insight: Y-axis shows rate of change (%/day), not success rate
📈 Positive slopes = performance improving over time
📉 Negative slopes = performance declining over time
➖ Zero slope = stable performance